Statistical models for unsupervised, semi-supervised and supervised transliteration mining

نویسنده

Hassan Sajjad

چکیده

We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings, unsupervised, semi-supervised and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e. noise). The model is trained on noisy unlabelled data using the EM algorithm. During training the transliteration sub-model learns to generate transliteration pairs while the fixed non-transliteration model generates the noise pairs. After training, the unlabelled data is disambiguated based on the posterior probabilities of the two submodels. We evaluate our transliteration mining system on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with less than 2% transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

We propose a novel model to automatically extract transliteration pairs from parallel corpora. Our model is efficient, language pair independent and mines transliteration pairs in a consistent fashion in both unsupervised and semi-supervised settings. We model transliteration mining as an interpolation of transliteration and non-transliteration sub-models. We evaluate on NEWS 2010 shared task d...

متن کامل

Semi-supervised transliteration mining from parallel and comparable corpora

Transliteration is the process of writing a word (mainly proper noun) from one language in the alphabet of another language. This process requires mapping the pronunciation of the word from the source language to the closest possible pronunciation in the target language. In this paper we introduce a new semi-supervised transliteration mining method for parallel and comparable corpora. The metho...

متن کامل

Statistical machine learning for data mining and collaborative multimedia retrieval

of thesis entitled: Statistical Machine Learning for Data Mining and Collaborative Multimedia Retrieval Submitted by HOI, Chu Hong (Steven) for the degree of Doctor of Philosophy at The Chinese University of Hong Kong in September 2006 Statistical machine learning techniques have been widely applied in data mining and multimedia information retrieval. While traditional methods, such as supervis...

متن کامل

Semi-Supervised Lexicon Mining from Parenthetical Expressions in Monolingual Web Pages

This paper presents a semi-supervised learning framework for mining Chinese-English lexicons from large amount of Chinese Web pages. The issue is motivated by the observation that many Chinese neologisms are accompanied by their English translations in the form of parenthesis. We classify parenthetical translations into bilingual abbreviations, transliterations, and translations. A frequency-ba...

متن کامل

A Semi-supervised Ensemble Approach for Mining Data Streams

There are many challenges in mining data streams, such as infinite length, evolving nature and lack of labeled instances. Accordingly, a semi-supervised ensemble approach for mining data streams is presented in this paper. Data streams are divided into data chunks to deal with the infinite length. An ensemble classification model E is trained with existing labeled data chunks and decision bound...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Computational Linguistics

دوره 43 شماره

صفحات -

تاریخ انتشار 2013

Statistical models for unsupervised, semi-supervised and supervised transliteration mining

نویسنده

چکیده

منابع مشابه

A Statistical Model for Unsupervised and Semi-supervised Transliteration Mining

Semi-supervised transliteration mining from parallel and comparable corpora

Statistical machine learning for data mining and collaborative multimedia retrieval

Semi-Supervised Lexicon Mining from Parenthetical Expressions in Monolingual Web Pages

A Semi-supervised Ensemble Approach for Mining Data Streams

عنوان ژورنال:

اشتراک گذاری